1 More in-depth: PAM clustering based on Sørensen-distance matrices


Background

Microbiome sample clustering can be performed using either model-based methods and machine learning methods. Machine learning methods, which rely on defined distance metrics, are used more frequently than model-based statistical methods (“due to their efficient implementation and easy interpretation.”)
- I used the partition around medoids (PAM) clustering method, which is related to but considered more robust than K-means. In contrast to K-means, which can be sensitive to the effects of outliers, PAM’s optimization goal is to minimize the sum of distances to the medoids instead of minimizing the sum of the squared distances to the cluster centers.

Note: clustering was performed directly on distance matrices, not ordinations or ordination scores! This was done to eliminate the introduction of artifacts arising from clustering ordination scores.

1.1 Clustering evaluation: Gap Statistic


Methods

Partitioning methods, such as PAM clustering require the number of clusters to be generated as an input value. Here we are viewing the gap statistic as calculated by the function cluster::clusGap(). The optimal number of clusters was specified using the “Tibs2001SEmax” method, Tibshirani et al’s recommendation of determining the number of clusters from the gap statistics and their standard deviations.

clusGap() calculates a goodness of clustering measure, the “gap” statistic. For each number of clusters \(k\), it compares \(log(W(k))\) with \(E∗[log(W(k))]\) where the latter is defined via bootstrapping, i.e., simulating from a reference (\(H_0\)) distribution, a uniform distribution on the hypercube determined by the ranges of x, after first centering, and then svd (aka ‘PCA’)-rotating them when (as by default) spaceH0 = "scaledPCA".

maxSE(f, SE.f) determines the location of the maximum of f, taking a “1-SE rule” into account for the *SE* methods. The default method "firstSEmax" looks for the smallest \(k\) such that its value \(f(k)\) is not more than 1 standard error away from the first local maximum. This is similar but not the same as "Tibs2001SEmax", Tibshirani et al’s recommendation of determining the number of clusters from the gap statistics and their standard deviations.

References
Tibshirani, R., Walther, G. and Hastie, T. (2001). Estimating the number of data clusters via the Gap statistic. Journal of the Royal Statistical Society B, 63, 411–423.

Tibshirani, R., Walther, G. and Hastie, T. (2000). Estimating the number of clusters in a dataset via the Gap statistic. Technical Report. Stanford.

Dudoit, S. and Fridlyand, J. (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology 3(7). [doi:10.1186/gb-2002-3-7-research0036\\](doi:10.1186/gb-2002-3-7-research0036\

Per Broberg (2006). SAGx: Statistical Analysis of the GeneChip. R package version 1.9.7. https://bioconductor.org/packages/3.12/bioc/html/SAGx.html Deprecated and removed from Bioc ca. 2022

Fun


We found the optimal number of clusters for the Sørensen-based dissimilarities of the fungal communities to be 10.




Bac


We found the optimal number of clusters for the Sørensen-based dissimilarities of the bacterial communities to be 14.




1.2 View clustering results overlaid onto NMDS plots


Sørensen-based clusters


Fun

Note the density of cluster 1 (now F1) - I’ll investigate that further below. Fungal clusters have now been relabeled in an intuitive order based on decreasing similiarities to F1.


All samples


Facet by Lat


Facet by Long


Facet by Cluster, color Site

We do see clusters with only 1 site and others with multiple sites.





2 adonis2(. ~ metadata, strata = clusters)


2.1 Explanatory power of cluster designations


Fun


Df SumOfSqs R2 F Pr(>F)
Fun_sor_clusters 9 56.9 0.34 27.12 0.001
Residual 474 110.5 0.66 NA NA
Total 483 167.4 1.00 NA NA

Across the whole dataset, the fungal clustering explained 33.99% of the community variability (presence/absence).


Bac


TBD

Across the whole dataset, the bacterial clustering explained 33.99% of the community variability (presence/absence).


2.2 Explanatory power of metadata within clusters


Fun


Note that in fungal cluster “F2” there were only 10 samples (all site=SFA, grass=SCSC), and because enviromental data was measured on a pooled site-by-grass bases, no underlying variability exists to perform stats on.


whole community

Table 2.1: adonis2 outputs for single-term models
Df SumOfSqs R2 F Pr(>F) comm expl_var
Edaphic
Fun_wholecommunity.pH_m.std 1 5.766 0.034 17.188 0.001 Fun_wholecommunity pH
Fun_wholecommunity.soil_moisture_m.std 1 5.695 0.034 16.969 0.001 Fun_wholecommunity soil_moisture
Fun_wholecommunity.SOM_m.std 1 6.082 0.036 18.167 0.001 Fun_wholecommunity SOM
Fun_wholecommunity.ammonium_m.std 1 6.447 0.039 19.303 0.001 Fun_wholecommunity ammonium
Fun_wholecommunity.phos_m.std 1 6.227 0.037 18.616 0.001 Fun_wholecommunity phos
Climate
Fun_wholecommunity.ppt3yr_m.std 1 6.046 0.036 18.057 0.001 Fun_wholecommunity ppt3yr
Fun_wholecommunity.GDD3yr_m.std 1 2.574 0.015 7.526 0.001 Fun_wholecommunity GDD3yr
Plant
Fun_wholecommunity.avg_SRL_m.std 1 1.847 0.011 5.375 0.001 Fun_wholecommunity avg_SRL
Fun_wholecommunity.avg_SLA_m.std 1 3.154 0.019 9.253 0.001 Fun_wholecommunity avg_SLA
Fun_wholecommunity.herbivory_perc_m.std 1 2.380 0.014 6.949 0.001 Fun_wholecommunity herbivory_perc


F1

Table 2.2: adonis2 outputs for single-term models
Df SumOfSqs R2 F Pr(>F) comm expl_var
Edaphic
Fun_wc_F1.pH_m.std 1 1.537 0.063 12.119 0.001 Fun_wc_F1 pH
Fun_wc_F1.soil_moisture_m.std 1 1.203 0.049 9.347 0.001 Fun_wc_F1 soil_moisture
Fun_wc_F1.SOM_m.std 1 1.529 0.062 12.055 0.001 Fun_wc_F1 SOM
Fun_wc_F1.ammonium_m.std 1 1.193 0.049 9.271 0.001 Fun_wc_F1 ammonium
Fun_wc_F1.phos_m.std 1 1.230 0.050 9.572 0.001 Fun_wc_F1 phos
Climate
Fun_wc_F1.ppt3yr_m.std 1 1.863 0.076 14.898 0.001 Fun_wc_F1 ppt3yr
Fun_wc_F1.GDD3yr_m.std 1 1.677 0.068 13.301 0.001 Fun_wc_F1 GDD3yr
Plant
Fun_wc_F1.avg_SRL_m.std 1 0.655 0.027 4.976 0.001 Fun_wc_F1 avg_SRL
Fun_wc_F1.avg_SLA_m.std 1 1.076 0.044 8.319 0.001 Fun_wc_F1 avg_SLA
Fun_wc_F1.herbivory_perc_m.std 1 0.578 0.024 4.373 0.001 Fun_wc_F1 herbivory_perc


F3

Table 2.3: adonis2 outputs for single-term models
Df SumOfSqs R2 F Pr(>F) comm expl_var
Edaphic
Fun_wc_F3.pH_m.std 1 1.664 0.223 7.751 0.001 Fun_wc_F3 pH
Fun_wc_F3.soil_moisture_m.std 1 0.759 0.102 3.059 0.002 Fun_wc_F3 soil_moisture
Fun_wc_F3.SOM_m.std 1 1.627 0.218 7.534 0.001 Fun_wc_F3 SOM
Fun_wc_F3.ammonium_m.std 1 0.426 0.057 1.636 0.060 Fun_wc_F3 ammonium
Fun_wc_F3.phos_m.std 1 0.948 0.127 3.931 0.001 Fun_wc_F3 phos
Climate
Fun_wc_F3.ppt3yr_m.std 1 1.676 0.225 7.827 0.001 Fun_wc_F3 ppt3yr
Fun_wc_F3.GDD3yr_m.std 1 1.676 0.225 7.827 0.001 Fun_wc_F3 GDD3yr
Plant
Fun_wc_F3.avg_SRL_m.std 1 1.488 0.200 6.730 0.001 Fun_wc_F3 avg_SRL
Fun_wc_F3.avg_SLA_m.std 1 1.670 0.224 7.789 0.001 Fun_wc_F3 avg_SLA
Fun_wc_F3.herbivory_perc_m.std 1 0.645 0.087 2.557 0.005 Fun_wc_F3 herbivory_perc


F4

Table 2.4: adonis2 outputs for single-term models
Df SumOfSqs R2 F Pr(>F) comm expl_var
Edaphic
Fun_wc_F4.pH_m.std 1 0.455 0.067 1.929 0.002 Fun_wc_F4 pH
Fun_wc_F4.soil_moisture_m.std 1 0.352 0.052 1.469 0.012 Fun_wc_F4 soil_moisture
Fun_wc_F4.SOM_m.std 1 0.483 0.071 2.060 0.002 Fun_wc_F4 SOM
Fun_wc_F4.ammonium_m.std 1 0.352 0.052 1.469 0.009 Fun_wc_F4 ammonium
Fun_wc_F4.phos_m.std 1 0.358 0.052 1.495 0.010 Fun_wc_F4 phos
Climate
Fun_wc_F4.ppt3yr_m.std 1 0.409 0.060 1.724 0.068 Fun_wc_F4 ppt3yr
Fun_wc_F4.GDD3yr_m.std 1 0.396 0.058 1.666 0.013 Fun_wc_F4 GDD3yr
Plant
Fun_wc_F4.avg_SRL_m.std 1 0.468 0.069 1.989 0.001 Fun_wc_F4 avg_SRL
Fun_wc_F4.avg_SLA_m.std 1 0.379 0.056 1.590 0.006 Fun_wc_F4 avg_SLA
Fun_wc_F4.herbivory_perc_m.std 1 0.430 0.063 1.820 0.005 Fun_wc_F4 herbivory_perc


F5

Table 2.5: adonis2 outputs for single-term models
Df SumOfSqs R2 F Pr(>F) comm expl_var
Edaphic
Fun_wc_F5.pH_m.std 1 0.720 0.156 2.779 0.001 Fun_wc_F5 pH
Fun_wc_F5.soil_moisture_m.std 1 0.645 0.140 2.442 0.001 Fun_wc_F5 soil_moisture
Fun_wc_F5.SOM_m.std 1 0.707 0.154 2.722 0.001 Fun_wc_F5 SOM
Fun_wc_F5.ammonium_m.std 1 0.673 0.146 2.565 0.001 Fun_wc_F5 ammonium
Fun_wc_F5.phos_m.std 1 0.651 0.141 2.467 0.001 Fun_wc_F5 phos
Climate
Fun_wc_F5.ppt3yr_m.std 1 0.368 0.080 1.303 0.119 Fun_wc_F5 ppt3yr
Fun_wc_F5.GDD3yr_m.std 1 0.402 0.087 1.435 0.041 Fun_wc_F5 GDD3yr
Plant
Fun_wc_F5.avg_SRL_m.std 1 0.397 0.086 1.414 0.053 Fun_wc_F5 avg_SRL
Fun_wc_F5.avg_SLA_m.std 1 0.691 0.150 2.648 0.001 Fun_wc_F5 avg_SLA
Fun_wc_F5.herbivory_perc_m.std 1 0.399 0.087 1.422 0.063 Fun_wc_F5 herbivory_perc


F6

Table 2.6: adonis2 outputs for single-term models
Df SumOfSqs R2 F Pr(>F) comm expl_var
Edaphic
Fun_wc_F6.pH_m.std 1 0.738 0.085 2.797 0.001 Fun_wc_F6 pH
Fun_wc_F6.soil_moisture_m.std 1 0.551 0.064 2.039 0.004 Fun_wc_F6 soil_moisture
Fun_wc_F6.SOM_m.std 1 0.496 0.057 1.822 0.007 Fun_wc_F6 SOM
Fun_wc_F6.ammonium_m.std 1 0.538 0.062 1.989 0.001 Fun_wc_F6 ammonium
Fun_wc_F6.phos_m.std 1 0.395 0.046 1.435 0.030 Fun_wc_F6 phos
Climate
Fun_wc_F6.ppt3yr_m.std 1 0.594 0.069 2.210 0.001 Fun_wc_F6 ppt3yr
Fun_wc_F6.GDD3yr_m.std 1 1.088 0.126 4.313 0.001 Fun_wc_F6 GDD3yr
Plant
Fun_wc_F6.avg_SRL_m.std 1 0.703 0.081 2.651 0.001 Fun_wc_F6 avg_SRL
Fun_wc_F6.avg_SLA_m.std 1 0.395 0.046 1.435 0.035 Fun_wc_F6 avg_SLA
Fun_wc_F6.herbivory_perc_m.std 1 0.561 0.065 2.078 0.001 Fun_wc_F6 herbivory_perc


F7

Table 2.7: adonis2 outputs for single-term models
Df SumOfSqs R2 F Pr(>F) comm expl_var
Edaphic
Fun_wc_F7.pH_m.std 1 1.035 0.099 3.962 0.001 Fun_wc_F7 pH
Fun_wc_F7.soil_moisture_m.std 1 0.753 0.072 2.802 0.001 Fun_wc_F7 soil_moisture
Fun_wc_F7.SOM_m.std 1 1.198 0.115 4.670 0.001 Fun_wc_F7 SOM
Fun_wc_F7.ammonium_m.std 1 0.582 0.056 2.125 0.001 Fun_wc_F7 ammonium
Fun_wc_F7.phos_m.std 1 0.362 0.035 1.293 0.107 Fun_wc_F7 phos
Climate
Fun_wc_F7.ppt3yr_m.std 1 1.199 0.115 4.671 0.001 Fun_wc_F7 ppt3yr
Fun_wc_F7.GDD3yr_m.std 1 1.008 0.097 3.848 0.001 Fun_wc_F7 GDD3yr
Plant
Fun_wc_F7.avg_SRL_m.std 1 0.801 0.077 2.993 0.001 Fun_wc_F7 avg_SRL
Fun_wc_F7.avg_SLA_m.std 1 0.790 0.076 2.948 0.001 Fun_wc_F7 avg_SLA
Fun_wc_F7.herbivory_perc_m.std 1 0.622 0.060 2.283 0.001 Fun_wc_F7 herbivory_perc


F8

Table 2.8: adonis2 outputs for single-term models
Df SumOfSqs R2 F Pr(>F) comm expl_var
Edaphic
Fun_wc_F8.pH_m.std 1 1.317 0.076 4.176 0.001 Fun_wc_F8 pH
Fun_wc_F8.soil_moisture_m.std 1 1.244 0.071 3.927 0.001 Fun_wc_F8 soil_moisture
Fun_wc_F8.SOM_m.std 1 1.586 0.091 5.116 0.001 Fun_wc_F8 SOM
Fun_wc_F8.ammonium_m.std 1 1.038 0.060 3.235 0.001 Fun_wc_F8 ammonium
Fun_wc_F8.phos_m.std 1 1.135 0.065 3.558 0.001 Fun_wc_F8 phos
Climate
Fun_wc_F8.ppt3yr_m.std 1 1.924 0.111 6.343 0.001 Fun_wc_F8 ppt3yr
Fun_wc_F8.GDD3yr_m.std 1 1.370 0.079 4.358 0.001 Fun_wc_F8 GDD3yr
Plant
Fun_wc_F8.avg_SRL_m.std 1 0.891 0.051 2.754 0.001 Fun_wc_F8 avg_SRL
Fun_wc_F8.avg_SLA_m.std 1 1.066 0.061 3.330 0.001 Fun_wc_F8 avg_SLA
Fun_wc_F8.herbivory_perc_m.std 1 1.116 0.064 3.495 0.001 Fun_wc_F8 herbivory_perc


F9

Table 2.9: adonis2 outputs for single-term models
Df SumOfSqs R2 F Pr(>F) comm expl_var
Edaphic
Fun_wc_F9.pH_m.std 1 0.727 0.041 2.328 0.001 Fun_wc_F9 pH
Fun_wc_F9.soil_moisture_m.std 1 0.928 0.053 3.005 0.001 Fun_wc_F9 soil_moisture
Fun_wc_F9.SOM_m.std 1 1.340 0.076 4.450 0.001 Fun_wc_F9 SOM
Fun_wc_F9.ammonium_m.std 1 0.914 0.052 2.957 0.001 Fun_wc_F9 ammonium
Fun_wc_F9.phos_m.std 1 0.808 0.046 2.599 0.001 Fun_wc_F9 phos
Climate
Fun_wc_F9.ppt3yr_m.std 1 1.227 0.070 4.048 0.001 Fun_wc_F9 ppt3yr
Fun_wc_F9.GDD3yr_m.std 1 1.064 0.060 3.473 0.001 Fun_wc_F9 GDD3yr
Plant
Fun_wc_F9.avg_SRL_m.std 1 0.650 0.037 2.071 0.001 Fun_wc_F9 avg_SRL
Fun_wc_F9.avg_SLA_m.std 1 0.703 0.040 2.246 0.001 Fun_wc_F9 avg_SLA
Fun_wc_F9.herbivory_perc_m.std 1 0.789 0.045 2.534 0.001 Fun_wc_F9 herbivory_perc


F10

Table 2.10: adonis2 outputs for single-term models
Df SumOfSqs R2 F Pr(>F) comm expl_var
Edaphic
Fun_wc_F10.pH_m.std 1 0.868 0.074 2.815 0.001 Fun_wc_F10 pH
Fun_wc_F10.soil_moisture_m.std 1 1.072 0.092 3.541 0.001 Fun_wc_F10 soil_moisture
Fun_wc_F10.SOM_m.std 1 0.970 0.083 3.175 0.001 Fun_wc_F10 SOM
Fun_wc_F10.ammonium_m.std 1 1.097 0.094 3.635 0.001 Fun_wc_F10 ammonium
Fun_wc_F10.phos_m.std 1 1.113 0.095 3.693 0.001 Fun_wc_F10 phos
Climate
Fun_wc_F10.ppt3yr_m.std 1 0.753 0.065 2.416 0.001 Fun_wc_F10 ppt3yr
Fun_wc_F10.GDD3yr_m.std 1 0.797 0.068 2.567 0.001 Fun_wc_F10 GDD3yr
Plant
Fun_wc_F10.avg_SRL_m.std 1 0.738 0.063 2.362 0.001 Fun_wc_F10 avg_SRL
Fun_wc_F10.avg_SLA_m.std 1 0.768 0.066 2.466 0.001 Fun_wc_F10 avg_SLA
Fun_wc_F10.herbivory_perc_m.std 1 0.814 0.070 2.626 0.001 Fun_wc_F10 herbivory_perc




Bac



TBD



3 Cluster descriptions


[in-progress]

Descriptions of Sørensen distance-based clusters. †variables listed are significant in all vs. base-mean Wilcox test with BH p-value corrections. * p-value < 0.05, ** < 0.001, *** < 0.0001, **** < 1e-5.
Cluster total n Site(s) Grass(es) Characteristics† Exclusivity
1 183 spans pH range
2 37 BNP,DMT,FMT,SEV BOER (n=33), BOGR (n=4) high pH* (6.8 - 8.3, mean 7.5)
3 56 high pH**** (6.8 - 8.3, mean 7.8)
4 17 lower pH**** (5.6 - 7.6, mean 6.2)
5 43 high pH** (7.1 - 7.8, mean 7.6)
6 32 lower pH**** (5.1 - 7.8, mean 6.2)
7 38
8 29 mostly LAR (n=27) high pH**** (7.2 - 8.2, mean 8.0)
9 9 SFA SCSC (only grass present) Site=SFA
10 29 KAE lower pH**** (5.9 - 7.2, mean 6.3) Site=KAE (of the 32 KAE samples, only 3 others were in diff clusters)




4 Are the clusters distinct/distinguishable? -> yes, mostly

Clusters: Sørensen dissimilarity clusters based on pam (k = 10)

Method: R package randomForest v4.7.1.1

predictors.all<-t(otu_table(Fun_wholecommunity))

response.clus_sor_k10<-as.factor(sample_data(Fun_wholecommunity)$clus_sor_k10_new)

rf.data.clus_sor_k10<-data.frame(response.clus_sor_k10, predictors.all)

classify.clus_sor_k10<-randomForest(response.clus_sor_k10~., data = rf.data.clus_sor_k10, ntree=999)

Call: randomForest(formula = response.clus_sor_k10 ~ ., data = rf.data.clus_sor_k10, ntree = 999)

Type of random forest: classification

Number of trees: 999

No. of variables tried at each split: 81

OOB estimate of error rate: 6.82%

Confusion matrix of fungal whole community randomForest classification based on Sørensen-based PAM clusters. Diagonals are accurate calls. Misclassifications are shown as values by row, i.e., cluster F3 was misclassified as cluster F8 in 1 sample, while 28 samples were accurately called, resulting in an error rate of 3.4% for that cluster.
F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 Cluster error %
F1 183 0.0
F2 10 0.0
F3 28 1 3.4
F4 1 27 1 6.9
F5 12 1 4 29.4
F6 23 2 7 28.1
F7 1 2 34 1 10.5
F8 1 1 1 50 5.7
F9 1 1 49 5 12.5
F10 2 35 5.4




adonis2

## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
## 
## adonis2(formula = phyloseq::distance(Fun_wholecommunity, method = "bray") ~ clus_sor_k10_new, data = (as(sample_data(Fun_wholecommunity), "data.frame")))
##                   Df SumOfSqs      R2      F Pr(>F)    
## clus_sor_k10_new   9   40.489 0.20715 13.761  0.001 ***
## Residual         474  154.968 0.79285                  
## Total            483  195.457 1.00000                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1




4.1 Dendrogram of Sørensen-based clusters (k = 10)


Fungal whole community

Dendrogram of Sørensen-based clusters of fungal whole communities. The terminal segments hang to within-cluster dissimilarity.
Dendrogram of Sørensen-based clusters of fungal whole communities. The terminal segments hang to within-cluster dissimilarity.

Clusters 2 & 3 seem to be the outgroup / most distantly related
Clusters 1 & 9 are very similar, followed by 10 -> wrap-around cluster numbering

4.2 Which OTUs are driving the differences between clusters 1 & 2?


SIMPER results


average - Species contribution to average between-group dissimilarity
cusum - Ordered cumulative contribution (to between-group dissimilarity). These are based on item average, but they sum up to total 1
p - Permutation \(p\)-value. Probability of getting a larger or equal average contribution in random permutation of the group factor


Mean Sørensen dissimilarity between Clusters 1 & 2: 0.920 (unweighted) / 0.903 (weighted by sample size)

Mean within-cluster similarities
- Cluster 1: 0.514
- Cluster 2: 0.802


5 Significantly differential taxa among clusters


5.1 Genus-level


Fusarium



5.2 OTU-level


OTU1252



OTU6679



OTU205






6 Taxonomy of clusters



7 Fungal

Phylum


Class


Order


Family


Generalists v Specialists